2,671 research outputs found

    Exploring signature multiplicity in microarray data using ensembles of randomized trees

    Get PDF
    A challenging and novel direction for feature selection research in computational biology is the analysis of signature multiplicity. In this work, we propose to investigate the eect of signature multiplicity on feature importance scores derived from tree-based ensemble methods. We show that looking at individual tree rankings in an ensemble could highlight the existence of multiple signatures and we propose a simple post-processing method based on clustering that can return smaller signatures with better predictive performance than signatures derived from the global tree ranking at almost no additional cost

    DMFSGD: A Decentralized Matrix Factorization Algorithm for Network Distance Prediction

    Full text link
    The knowledge of end-to-end network distances is essential to many Internet applications. As active probing of all pairwise distances is infeasible in large-scale networks, a natural idea is to measure a few pairs and to predict the other ones without actually measuring them. This paper formulates the distance prediction problem as matrix completion where unknown entries of an incomplete matrix of pairwise distances are to be predicted. The problem is solvable because strong correlations among network distances exist and cause the constructed distance matrix to be low rank. The new formulation circumvents the well-known drawbacks of existing approaches based on Euclidean embedding. A new algorithm, so-called Decentralized Matrix Factorization by Stochastic Gradient Descent (DMFSGD), is proposed to solve the network distance prediction problem. By letting network nodes exchange messages with each other, the algorithm is fully decentralized and only requires each node to collect and to process local measurements, with neither explicit matrix constructions nor special nodes such as landmarks and central servers. In addition, we compared comprehensively matrix factorization and Euclidean embedding to demonstrate the suitability of the former on network distance prediction. We further studied the incorporation of a robust loss function and of non-negativity constraints. Extensive experiments on various publicly-available datasets of network delays show not only the scalability and the accuracy of our approach but also its usability in real Internet applications.Comment: submitted to IEEE/ACM Transactions on Networking on Nov. 201

    Automated multimodal volume registration based on supervised 3D anatomical landmark detection

    Get PDF
    We propose a new method for automatic 3D multimodal registration based on anatomical landmark detection. Landmark detectors are learned independantly in the two imaging modalities using Extremely Randomized Trees and multi-resolution voxel windows. A least-squares fitting algorithm is then used for rigid registration based on the landmark positions as predicted by these detectors in the two imaging modalities. Experiments are carried out with this method on a dataset of pelvis CT and CBCT scans related to 45 patients. On this dataset, our fully automatic approach yields results very competitive with respect to a manually assisted state-of-the-art rigid registration algorithm

    Classifying pairs with trees for supervised biological network inference

    Full text link
    Networks are ubiquitous in biology and computational approaches have been largely investigated for their inference. In particular, supervised machine learning methods can be used to complete a partially known network by integrating various measurements. Two main supervised frameworks have been proposed: the local approach, which trains a separate model for each network node, and the global approach, which trains a single model over pairs of nodes. Here, we systematically investigate, theoretically and empirically, the exploitation of tree-based ensemble methods in the context of these two approaches for biological network inference. We first formalize the problem of network inference as classification of pairs, unifying in the process homogeneous and bipartite graphs and discussing two main sampling schemes. We then present the global and the local approaches, extending the later for the prediction of interactions between two unseen network nodes, and discuss their specializations to tree-based ensemble methods, highlighting their interpretability and drawing links with clustering techniques. Extensive computational experiments are carried out with these methods on various biological networks that clearly highlight that these methods are competitive with existing methods.Comment: 22 page

    Optimal model parameters for multi-objective large-eddy simulations

    Get PDF
    A methodology is proposed for the assessment of error dynamics in large-eddy simulations. It is demonstrated that the optimization of model parameters with respect to one flow property can be obtained at the expense of the accuracy with which other flow properties are predicted. Therefore, an approach is introduced which allows to assess the total errors based on various flow properties simultaneously. We show that parameter settings exist, for which all monitored errors are "near optimal," and refer to such regions as "multi-objective optimal parameter regions." We focus on multi-objective errors that are obtained from weighted spectra, emphasizing both large- as well small-scale errors. These multi-objective optimal parameter regions depend strongly on the simulation Reynolds number and the resolution. At too coarse resolutions, no multi-objective optimal regions might exist as not all error-components might simultaneously be sufficiently small. The identification of multi-objective optimal parameter regions can be adopted to effectively compare different subgrid models. A comparison between large-eddy simulations using the Lilly-Smagorinsky model, the dynamic Smagorinsky model and a new Re-consistent eddy-viscosity model is made, which illustrates this. Based on the new methodology for error assessment the latter model is found to be the most accurate and robust among the selected subgrid models, in combination with the finite volume discretization used in the present study

    Context-dependent feature analysis with random forests

    Full text link
    In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances framework in order (i) to identify variables whose relevance is context-dependent and (ii) to characterize as precisely as possible the effect of contextual information on these variables. The usage and the relevance of our framework for highlighting context-dependent variables is illustrated on both artificial and real datasets.Comment: Accepted for presentation at UAI 201

    Ensembles of extremely randomized trees and some generic applications

    Full text link
    peer reviewedIn this paper we present a new tree-based ensemble method called “Extra-Trees”. This algorithm averages predictions of trees obtained by partitioning the inputspace with randomly generated splits, leading to significant improvements of precision, and various algorithmic advantages, in particular reduced computational complexity and scalability. We also discuss two generic applications of this algorithm, namely for time-series classification and for the automatic inference of near-optimal sequential decision policies from experimental data
    corecore